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(57) Abstract 

A method for navigating through 
video matter by means of displaying 
a plurality of key-frmnes in parallel, 
whilst allowing selective accessing of 
displayed keyframes for thereupon con- 
trolling actual access to said video mat- 
ter as representing a mapping of so ac- 
cessed keyframes, said method being 
characterized by allowing within a single 
user interface organization to select be- 
tween a first operative mode for arrang- 
ing keyframes in a temporally ordered 
manner on the screen and a second oper- 
ative mode for arranging keyframes with 
multiple selectible granularities between 
contiguous keyframes as displayed. 
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A method and device for navigating through video matter by means of displaying a plurality 
of key-frames in parallel. 



BACKGROUND OF THE INVENTION 

The invention relates to a method according to the preamble of Claim 1. 
The usage of keyframes as representative parts of a video presentation, that is recorded for 
subsequent selective playback, has been proposed elsewhere. A continuous video stream 
5 means that video remains "on", which may include animation, a series of stills, or an 
interactive sequence of images. The character may be various, such as film, news, or for 
example a shopping list. State of the art is represented by the article 'Content-Based Video 
Indexing and Retrieval* by S.W. SmoUar and H.J. Zhang, IEEE Multimedia, Summer 1994, 
pages 62-72. 

10 Keyframes may be derived from video material upon its reception at the 

user's through a derivation algorithm, or keyframes may be labelled as such by a video 
provider, for example, in that each video shot will start with a keyframe. A third scheme is 
that the frames succeed each other with uniform time intervals as relating to standard video 
speed. The present invention recognizes that keyframes should be utilized so as to give users 

15 a dynamic overview over the presentation, combined with useful facilities for enabling them 
to easier access the material, for selecting or deselecting for subsequent display, or for 
editing. 

A particular problem with present-day projects for digital and compressed 
coding of video images is that storage thereof on mass media generally does not allow 

20 immediate access thereto, in particular in that the linear storage density such as expressable 
in frames per storage size is non-uniform. It has been proposed to supplement a high- 
capacity main storage medium such as tape with a secondary storage medium with smaller 
capacity and enhanced accessibility. In that case, the execution of trick modes, such as fast 
forward and fast reverse, as well as editing of the video material for subsequent presentation 

25 in an abstracted, modified, or rearranged form give rise to appreciable difficulties, both as 
seen from the aspect of the user interface, as well as perceived from the aspect of storage 
technology. 



/ 
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SUMMARY TO THE INVENTION 

In consequence, amongst other things, it is an object of the present 
invention to introduce more flexibility into the organization, as well as to present to a user 
with a more natural feeling of the storage organization as well as of the video material 

5 proper, whilst obviating the need to continually access the main storage medium. Now 
therefore, according to one of its aspects the invention is characterized according to the 
remainder of Claim 1. Presenting the frames in a temporally ordered manner allows to effect 
fast forward and fast reverse in a simple manner, for example if the frames succeed each 
other with uniform time intervals as relating to standard video speed. Furthermore, the easy 

10 change of hierarchical level with variable granularity in time between the frames, allows easy 
accessing and editing. The same is true if the keyframes or at least a fraction thereof derive 
from filmshot commencements, or from other relevant events generated by the original film 
editor. In this manner, a clustering operation may be effected automatically. 

Advantageously, the method may include highlighting a presently selected 

15 keyframe by enlarging it at a multiple-sized format with respect to other keyframes, whilst 
furthermore including detecting deleterious video interlacing effects and if so, reducing such 
effect by vertical decimation and/or including applying an upsampling filter to the image 
before display. Whereas video distortions in relatively smallish keyframes have been 
experienced as tolerable, if a particular keyframe is enlarged, extra measures should be taken 

20 for picture improvement. The inventor has recognized that this upgrading, although not 
always attaining the quality level present under standard rendering conditions, gives a 
pleasant and instructive improvement of picture quality. 

The invention also relates to a device arranged for implementing the steps 
of the method as recited. Further advantageous aspects of the invention are recited in 

25 dependent Claims. 

BRIEF DESCRIPTION OF THE DRAWING 

These and other aspects and advantages of the invention will be discussed 
more in detail with reference to the disclosure of preferred embodiments hereinafter, and in 
30 particular with reference to the appended Figures that show: 

Figure 1, a block diagram of a TV-Recorder combination; 

Figure 2, an exemplary structure of a video recording; 

Figure 3, a design of a scrolling mosaic user interface; 

Figure 4, a design of a scrolling list user interface; 
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Figure 5, a more extensive graphical user interface; 

Figure 6, the presentation of subtitles; 

Figure 7, a state diagram of the system operation. 

5 DFTAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

Various advantages, in particular, but not exclusively pertaining to an 
ordinary customer and private home use, imply the foUowing: 

• The keyframes must be presented in such a manner that they are distinguishable 
from each other by a user person located at a typical TV viewing distance. 

10 • The number of keyframes presented simultaneously should be sufficient for 

providing a user person with an overview of a significant part of the contents of 
the digital video material. 

• The keyframes should be displayed in an undistorted fashion such as by 

retaining aspect ratio. 

15 • Preferably, the remote control device of the TV set itself operates as user, 

control device. 

Feedback information should be perceivable from a typical viewing distance. 
Computer concepts such as "drag and drop" are generally not necessary. 

• It must be feasible that the facilities be used only occasionally, rather than 

20 continually. 

• The user interface should reflect the famiUar linear model of a video . 

presentation. 

DISCLOSURE OF A PARTICULAR EMBODIMENT 

25 Figure 1 is a diagram showing a TV-Recorder combination according to 

the invention. Item 20 represents tiie TV-set display and associated immediate contirol and 
powering. Item 22 represents an antenna, or a connection with anotiier type of signal 
distribution entity, such as cable distribution. This item includes, if appropriate, also tiie 
derivation of the digital video information or tiie digital signal part from tiie received signal. 

30 Item 34 represents the routing of tiie video stireams and associated information between the 
various subsystems of Figure 1. The routing is governed by control box 28 tiirough control 
signals on line 35. The latter has been drawn as a single bidirectional interconnection but 
may in fact be buUt from any number of unidirectional or bidirectional Unes. The control box 
receives detection signals from display 20 on line 30 and from further subsystems 38, 40, 
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whUst also controlling the latter two. Block 38 is a linear tape recorder with a very high 
storage capacity in the multi-gigabyte region. Block 40 is a magnetic disc recorder with a 
high storage capacity, but which is nevertheless only a fraction of that of recorder 38; on the 
other hand, access in recorder 40 is much faster through cross-track jumping. Together, 
5 blocks 38 and 40 form a two-level storage organization that is somewhat akin to a computer 
memory cache system, and stores all items of a video presentation at least once. Item 24 
rq)resents a remote control device that by way of wireless 26 communicates with display 
device 20, and indirecUy with subsystem 28 and further subsystems 38 and 40. 

Figure 2 shows an exemplary structure of a video presentation. For 
10 effecting the video matter functionaUty, bar 60 contains the video itself, either in the form of 
frames, or as a string of compressed video matter, such as MPEG-coded. The information is 
stored along the bar as video time progresses, although actual storage requirements need not 
be uniform over replay time. Interspersed keyframes have been indicated by dark vertical 
stripes such as 68. A keyframe is used as representing, or as being typical of the overall 
15 video in the interval up to the next keyframe. The keyframes may be singled out by a video 
provider as the first fi^e of each new shot through adding a label or inclusion in a "table of 
contents" (TOC). Alternatively, tiie receiver, flirough some algorithm, detects tiiat the video 
content changes abruptiy from one frame to the next. The present invention takes the 
associated algoritiims for granted. As shown, tiieir distribution may be non-uniform. A 
20 fiirtiier mechanism is tiiat successive keyframes succeed each other at prescribed intervals, 
such as every 2-3 seconds. In the embodiment, at indication 62 only the keyframes are 
represented. Furthermore, the keyframes are organized in some hierarchy, in that indication 
64 has only a limited set of highly relevant keyframes. This hierarchization may be multi- 
level, in that indication 66 is associated witii only a single keyframe for aU of the video 
25 presentation 60. The various levels of keyframes may be determined in different ones of the 
organizations recited supra, and may even exist side by side. 

The storage mapping on Figure 1 may be effected in that the main body 
of tiie video presentation is stored in tape recorder 38, whereas at least tiie keyframes are 
reproduced in disc recorder 40, possibly together with short video and/or audio intervals 
30 immediately following tiie associated keyframe. The lengtti of such interval may correspond 
to tiie time latency of linear tape recorder 38, so tiiat tiiereby real-time access may be 
attained. By itself, tiie video presentation may be essentially linear, such as a film. 
Alternative usage is that certain storage intervals may contain animation, stills, or ottier 
images to be used by a consumer present. A possible infiuencing of a keyframe is to 
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suppress it. This effectively joins the time interval before the keyframe in question with the 
time interval behind it. A reset feature may again disjoin the interval. Also, various classes 
of keyframes may be suppressed, such as the class that is separated from each other by a 
fixed time interval. Various different classes of keyframes could be available for one 
5 presentation, such as those introduced by the provider versus those that are generated by a 

local algorithm at reception. 

Figure 3 is shows a design of a scrolling mosaic user interface. Every 
screen presents 20 keyframes starting from top left to right down: each keyframe has its 
number in the overall ranking of keyframes shown. Actually, keyframe 144 is highlighted by 

10 a rectangular control cursor. A user person may activate a remote contiiol to move the cursor 
freely over the keyfi^es displayed, as well as over the buttons displayed at the top and 
bottom bars through the navigational controls on the cursor device. If the user moves the 
control cursor to the left in the top left comer, the display jumps back by 20 keyframes. 
Moving to the right in the lower right hand comer will cause a forward jump over 20 

15 frames. Accessing the top bar of tiie screen will control accessing other parts of the 

presentation, in that the latter is divided into five equally long parts: a black horizontal bar 
indicates the total time covered by tiie twenty keyframes displayed here, of the overall 
presentation. 

Other functions are initiated by first selecting a particular keyframe and 
20 subsequentiy one of the bottom buttons. "View program" controls a start at a cursor-accessed 
keyframe. "View segment" does tiie same, but plays only a single segment, that will end at 
the next keyframe. "View from x to y" controls a start at the earUest in time of two cursor- 
accessed keyframes, and stops at tiie last in time of tiie two. Otiier modes are feasible 
together with the keyframe-selecting functionalities. Examples are fast-fprward or skast 

25 forward , that allow a user person to check a particular interval for certain occurrences, or 
fast/slow reverse to attain certain video effects. During display, upon passing the instant in 
time pertaining to a particular keyframe, the latter becomes active and effectively displays 
tiie video stream, until arriving at tiie instant associated witii tiie next keyframe. Thereupon, 
tiie latter becomes the active frame. The above feahire allows a user to straightforwardly 

30 program a video recorder for an interval display sequence such as by leaving out certain 
segments, such as advertising, or raflier, to draw attention to certain detaUs by means of 
slow-forward. During tiie display, audio may be active or suppressed ttirough a control 
button not shovm. Alternatively, control may let audio go one, whereas tiie video cursor is 
discrete, in that it steps only from interval to interval tiiough appropriate highlighting. 
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Figure 4 is an exemplary design of a scrolling list user interface. Here, at 
its bottom the overall screen has five keyframes displayed, keyframe 145 being highlighted 
by a rectangular control cursor that runs along its edges. Keyframe 145 is also displayed at a 
larger magnification in the background. The control interface is the same as in Figure 3, 
5 although the button positions are different. Alternatively, the enlarged keyframe is suppressed 
in the multi-keyframe bar. 

Figure 5 shows a more extensive graphical user interface. First, left and 
right are columns of control buttons for play, stop, select, cut, paste, fast reverse, 20om+, 
zoom-, fast forward. The bottom row has a sequence of nine keyframes that pertain to 
10 respective different scenes or shots, in they have hardly any correlations therebetween. 

Through stepping in the hierarchical organization of the keyframes, a good overview on the 
scene-to-scene dynamics may be gathered. The inter-keyframe distance could be, for 
example, ten seconds, but greater and smaller spacings could be feasible. Especially with 
short distances between successive keyframes in time, features such as fast forward can be 
15 well realized. On the other hand, the same size of spacing could be used for fuU playback of 
all audio, whilst the video would only jump from one keyframe to the next. Now, the central 
keyframe is also represented in an enlarged manner. When playing closely spaced keyframes 
that have low enough granularity, the enlarged keyframe may be presented in a dynamic 
manner, for so effecting fast forward (or backward) mode. Upon reaching the material of the 
20 next keyframe, here showing a saiUng vessel, the bottom row shifts one position to the left, 
so that the "sun" at left becomes obscured and a new keyframe ent^s from the right. Such 
display could in particular be at a faster frame rate than standard video, as mapped on the 
presentation from background storage medium. The reverse organization allows for fast 
reverse. 

25 Figure 6 shows the presentation of subtitles, in the general format as 

discussed with reference to Figure 5. In the central field, space 50 has been devoted to the 
actual frame; space 52 has been devoted to displaying subtitles derived from, or associated to 
the video presentation, or to other relevant information, such as speech-to-text converted for 
the deaf, or a translation into another language than the one used for actual speech. It would 

30 not be necessary that the subtitles derive only from the range associated to the seven 

keyframes at tiie screen bottom. Their relevance could stretch much furtiier. Further, each 
keyframe has a time code 54 or other relevant data overlayed thereon. The two columns of 
control buttons 56, 58 have been devoted to application operations at left, and intra-program 
operators at right. The top of the screen has the titie 60 of the actual video program 
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displayed. 

The rationale of the arrangements for having a dynamic representation of 
the video cursor that runs in time in the actually active keyframe field, is that the static 
representation of the keyframes alone does less sufficientiy convey tiie dynamics of the video 
5 representation as a whole, when such dynamics let the user better understand the evolving of 
the events. Therefore, the semantics are enhanced as follows. After tiie system has been idle 
for a certain amount of time, the keyframe that the cursor 'encloses' will 'come aUve' , 
because it will start playing die digital video material in miniature, including any associated 
audio and further effects. If during the playback, the next keyframe is reached, the cursor 
10 will automatically 'jump' to die next keyftame presented in the user interfece, until the user 
will (re)start interacting with tiic system. In general, the organization described herein will 
allow browsing tiirough information that is different and separate from the overall video 
string. Even if only the audio is played in a dynamic manner, whilst jumping from any 
keyframe to the next, the user gets a better impression of the underlying video, at 
15 particularly low storage requirements. 

In this respect, Figure 7 is a state diagram of the system operation. In 
state 100, tiie system awaits input from tiie user, while displaying die multiple keyframes. 
Such input may imply jumping among the displayed keyframes, jumping to anotiier set of 
keyframes, selecting a keyframe for displaying die associated interval. Any such input effects 
20 arrow 104 and starts a new time interval. Absent any such input during n seconds (such as 
20 seconds), effects arrow 108, so tiiat state 102 is reached. Therein, the system runs the 
dynamic video cursor frame. As long as no user input is received, arrow 110 is effected, and 
tiie system continues as long as displayable video material is available. If user input is 
received however, arrow 106 is effected, and the system freezes, either at the actual content 
25 of die dynamic video cursor frame, or at tiie beginning of the actual interval. 

DETECTION AND FILTEBaNG OF KEYFRAMES AFFECTED BY "INTERLACING" 
EFFECT 

Some of die keyframes used to browse die content of die video program, 
30 may have been extiacted from a sequence with high motion. This produces an annoying zig- 
zag effect in case die video sequence was encoded witii interlaced coding mode as normally 
is die case, a frame being made up of two fields which contoibute to form die complete 
frame, where even lines belong to one field, odd lines to die oUier. This problem is more 
evident and annoying in small keyframes, where die effect is more visible, when die picttire 
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is magnified and the lines become thick blocks. 

First, the keyframes affected by such an interlacing effect must be 
detected. This effect can be observed on the rows of the image and luminance variations 
cause high frequency values. This can be exploited by splitting the spatial frequency 
5 spectrum into many sub-bands and then considering only the high frequency components. 
Actually, the effect we want to detect must present alternating values of luminance between 
even and odd lines, therefore the highest sampling frequency of the resulting picture. The 
only coefficient that has to be computed is the highest frequency component of a frequency 
transformation (FFT or, better, DCT) on the columns. When the picture is affected by the 
10 zig-zag effect due to intCTlacing, this component will have a high value. 

However, this effect will also be visible in correspondence of an object 
with motion, especially with components in the horizontal direction. Therefore we should not 
consider the total sum of the coefficients, as this would also yield high values in a picture 
with detailed and contrasted patterns and finally produce false positives. 
15 A better result can be obtained by splitting the image in several sub-parts, and considering 
the greatest value per area. For example, by summing the two highest values of each are, the 
overall sum will be less susceptible to highly detailed images. 

Finally, as a lower vertical resolution is less annoying than the zig-zag 
effect, the simplest way of filtering this image is to consider only one field and then 
20 upsampling it vertically by a factor of 2. An interpolating filter, as mentioned in the 
following section, can be applied before showing the resulting picture. 

So, the detection and correction are effected as follows. The first step is 
to discard one field by removing half of the rows, either even or odd; then, an upsampling 
by a fector of 2 is performed along the rows in order to recover the original size of the 
25 keyframe, followed by an interpolating filter. In this case, the filter performs a simple Unear 
interpolation. 

UPSAMPLING AND INTERPOLATION 

To be easily visible from a TV-viewing distance, the keyframe has to be 
30 enlarged at almost full screen size by means of an upsampling followed by an interpolating 
filter. Whereas generally the keyframe has a low resolution, it has to be enlarged by quite a 
high factor. This means that if it is further processed, the result would not be good-looking, 
as pixels become large blocks. Therefore the picture must be filtered, but a trade-off must be 
found since we need to generate a good quality picture to be shown at high resolution, but 
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also fast processing for the application to have a short response time. The issue is that the 
enlargement must be performed on-the-flight on the picture, which means that the image 
cannot be enlarged and filtered just once, to be stored on the hard disk and re-used, because 
it would require too much storage space. Therefore the upsampling and filtering process must 
5 be as fest as possible whUe maintaining at the same time an acceptable result. Normally a 
usual interpolating filter may be employed (cf. any book on Digital Signal Processing; a 
relevant pap« can be: H.C. Andrews. C.L. Patterson, Digital Interpolation of Discrete 
Images, tfff. Trans. Comput. 196, v25, pages 196-202). 

Alternative techniques to improve the image quality can be used as weU. 
10 Amongst them, wavelet-based solutions and fractals approaches seem to lead to a higher 
computational burden, but show outstanding results in visual quality. In fact, fractal 
compression techniques are well-known to be resolution independent: the details on a higher 
resolution can be reconstructed or simulated by applying the same decoding process 
iteratively. In this case what will be stored is a fractal compressed picture, yielding a high 
15 compression factor. Similarly, by using wavelets transformation, high frequency components 
on higher scales can be predicted to obtain a higher resolution image without blurring effects. 

TEXTUAL SEARCH ON VIDEO PROGRAMMES BASED ON SUBTITLES 

In current video transmissions, subtitles are often transmitted along with 
20 the program (often in the Vertical Blankmg Interval for analog systems or in a sq)arate 

elementary stream in digital transmissions). This is normally used for programs distributed in 
foreign languages and not synchronized, or is meant for persons with hearing disability. Such 
information is normally superimposed on the screen, but could also be recorded on a storage 
medium. In this way, the speech of the program, and sometimes also some description of tiie 
25 sound for deaf people, is available for search operations. 

The extinction of tiiis kind of information should happen in real time, 
whUe ti»e program is being recorded. If this technique is coupled to ttie keyframe extraction 
routines, we may link flie picttire to tiie related text, i.e. tiie dialogue tiiat takes place in that 
part of tiie program from which tiie keyframe has been extracted. In tiiis way, witii current 
30 text retrieval techniques, we can perform text retrieval based on specific keywords. A 

specific tool of tiie appUcation will offer tiie possibility to perform simple queries based on 
keywords and tiieir composition, as now commonly used in"Web" search engines. 

As an example, suppose a news program has been recorded. If we intend 
to retrieve news regarding France, when tiie word "France" is inserted, tiie system will 
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automatically look for this word in the text of the program. If the result is positive, the user 
will be presented with the keyframe related to that part of the program and the specific part 
of subtitles where the keyword was found. The user can then start watching the program 
starting rom that particular point. If more keyframes are found as result of the query, they 
5 will be all shown on the bottom of the screen, as in Figure 5, so that the user can analyze 
the related text one by one on the larger window. Of course similar keywords can be used 
(French, Paris) if the result was negative. This system can also be useful in sports programs 
to extract reports covering a specific team or sport. 

Many other applications are feasible, for example to check whether a 
10 movie is to be allowed for children's viewing, by checking whether the words used in the 
dialogues are or not included in a list of "bad words. 

Possible extension of such a system include: 
o extracting the text from the screen, such as by OCR techniques on still pictures, 

if the text is not available separately from the video 
15 o using speech recognition technology to extract the dialogues from the program. 

In this case the system will be always independent from the service offered by 
the broadcaster, so that even in case no subtitle is provided, text retrieval will 
always be possible at least on some specific keywords that the system can be 
trained to recognize. 
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CLAIMS : 



I A method for navigating through video matter by means of displaying one 

or more series of a pluraUty of key-frames in parallel, whilst allowing selective accessing of 
displayed keyftames for thereupon controlling actual access to said video matter as 
representing a m2q)ping of so accessed keyframes, 

5 said method being characterized by allowing within a single user interface 

organization to select between a first operative mode for arranging keyframes in a temporally 
ordered manner on the screen and a second operative mode for arranging keyframes with 
multiple selectible granularities between contiguous keyframes as displayed. 
2. A metiiod as claimed in Claim 1, and whilst in said temporally ordered 

10 manner progressively playing back an audio interval associated to a temporally centered 
keyframe. 

3 method as claimed in Claim 2, wherein successive audio intervals will 

constitute a substantially continuous audio representation witii respect to a sequence of 

discretely spaced keyframes. 
15 4. A method as claimed in Claim 1, and in tiie second operative mode 

playing back an audio interval associated to an actually accessed keyframe. 

5. A method as claimed in Claim 1, characterized by highUghting a presenfly 
selected keyframe whilst simultaneously enlarging it at a multiple-sized format with respect 
to otiier keyframes, tiie metiiod furtiiermore including detecting deleterious video interhicing 

20 effects and if so, reducing such effect by vertical decimation. 

6. A metiiod as claimed in Claim 1, characterized by highlightiiig a presentiy 
selected keyframe whilst simultaneously enlarging it at a multiple format witii respect to 
otiier keyframes, the metiiod furtiiermore including applying an upsampling filter to tiie 
image before display. 

25 7. A metiiod as claimed in Claim 1, whilst furthermore displaying associated 

to an actuaUzed keyframe a subtitie or otiier relevant information extracted for an associated 
keyframe or sequence of keyframes. 

g. A device being arranged for executing a method as claimed in Claim 1. 
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